Chapter 1. Introduction

Chapter 2. Preparation

Import data

Before importing the data, we preprocessed the raw data in Python to obtain a nicer data frame as the raw data contains some columns written in JSON format with many attributes.

## 'data.frame':    4803 obs. of  12 variables:
##  $ X                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ budget              : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ genres              : Factor w/ 21 levels "","Action","Adventure",..: 2 3 2 2 2 10 4 2 3 2 ...
##  $ popularity          : num  150.4 139.1 107.4 112.3 43.9 ...
##  $ production_companies: Factor w/ 1314 levels "","100 Bares",..: 615 1263 265 696 1263 265 1263 758 1267 320 ...
##  $ release_date        : Factor w/ 3281 levels "","1916-09-04",..: 2315 1945 3185 2688 2635 1940 2450 3111 2246 3234 ...
##  $ revenue             : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime             : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ title               : Factor w/ 4800 levels "(500) Days of Summer",..: 381 2653 3186 3614 1906 3198 3364 382 1587 444 ...
##  $ vote_average        : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote_count          : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ Number_Genres       : int  4 3 3 4 3 3 2 3 3 3 ...

Data Cleanining

## 'data.frame':    3225 obs. of  15 variables:
##  $ budget    : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ genres    : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
##  $ popularity: num  150.4 139.1 107.4 112.3 43.9 ...
##  $ company   : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
##  $ date      : Date, format: "2009-12-10" "2007-05-19" ...
##  $ revenue   : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime   : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ title     : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
##  $ score     : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote      : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ profit    : num  2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
##  $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ season    : Factor w/ 4 levels "Spring","Summer",..: 4 1 3 2 1 1 3 1 2 1 ...
##  $ quarter   : Factor w/ 4 levels "Q1","Q2","Q3",..: 4 2 4 3 1 2 4 2 3 1 ...
##  $ year      : num  2009 2007 2015 2012 2012 ...

Data Summary

##      budget               genres      popularity 
##  Min.   :1.00e+00   Drama    :745   Min.   :  0  
##  1st Qu.:1.05e+07   Comedy   :634   1st Qu.: 10  
##  Median :2.50e+07   Action   :588   Median : 20  
##  Mean   :4.07e+07   Adventure:288   Mean   : 29  
##  3rd Qu.:5.50e+07   Horror   :197   3rd Qu.: 37  
##  Max.   :3.80e+08   Crime    :141   Max.   :876  
##                     (Other)  :632                
##                company          date               revenue        
##  Others            :1636   Min.   :1916-09-04   Min.   :5.00e+00  
##  Paramount Pictures: 255   1st Qu.:1998-09-10   1st Qu.:1.71e+07  
##  Sony Pictures     : 277   Median :2005-07-20   Median :5.52e+07  
##  Universal Pictures: 338   Mean   :2002-03-18   Mean   :1.21e+08  
##  Walt Disney       : 497   3rd Qu.:2010-11-11   3rd Qu.:1.46e+08  
##  Warner Bros       : 222   Max.   :2016-09-09   Max.   :2.79e+09  
##                                                                   
##     runtime                           title          score     
##  Min.   : 41   The Host                  :   2   Min.   :2.30  
##  1st Qu.: 96   (500) Days of Summer      :   1   1st Qu.:5.80  
##  Median :107   [REC]                     :   1   Median :6.30  
##  Mean   :111   [REC]²                    :   1   Mean   :6.31  
##  3rd Qu.:121   10 Cloverfield Lane       :   1   3rd Qu.:6.90  
##  Max.   :338   10 Things I Hate About You:   1   Max.   :8.50  
##                (Other)                   :3218                 
##       vote           profit          profitable    season    quarter 
##  Min.   :    1   Min.   :-1.66e+08   0: 787     Spring:704   Q1:656  
##  1st Qu.:  179   1st Qu.: 2.52e+05   1:2438     Summer:837   Q2:757  
##  Median :  471   Median : 2.64e+07              Fall  :930   Q3:931  
##  Mean   :  978   Mean   : 8.07e+07              Winter:754   Q4:881  
##  3rd Qu.: 1148   3rd Qu.: 9.75e+07                                   
##  Max.   :13752   Max.   : 2.55e+09                                   
##                                                                      
##       year     
##  Min.   :1916  
##  1st Qu.:1998  
##  Median :2005  
##  Mean   :2002  
##  3rd Qu.:2010  
##  Max.   :2016  
## 
##     revenue             budget           popularity     runtime   
##  Min.   :5.00e+00   Min.   :1.00e+00   Min.   :  0   Min.   : 41  
##  1st Qu.:1.71e+07   1st Qu.:1.05e+07   1st Qu.: 10   1st Qu.: 96  
##  Median :5.52e+07   Median :2.50e+07   Median : 20   Median :107  
##  Mean   :1.21e+08   Mean   :4.07e+07   Mean   : 29   Mean   :111  
##  3rd Qu.:1.46e+08   3rd Qu.:5.50e+07   3rd Qu.: 37   3rd Qu.:121  
##  Max.   :2.79e+09   Max.   :3.80e+08   Max.   :876   Max.   :338  
##      score           vote           profit         
##  Min.   :2.30   Min.   :    1   Min.   :-1.66e+08  
##  1st Qu.:5.80   1st Qu.:  179   1st Qu.: 2.52e+05  
##  Median :6.30   Median :  471   Median : 2.64e+07  
##  Mean   :6.31   Mean   :  978   Mean   : 8.07e+07  
##  3rd Qu.:6.90   3rd Qu.: 1148   3rd Qu.: 9.75e+07  
##  Max.   :8.50   Max.   :13752   Max.   : 2.55e+09

Variance and SD

##    revenue     budget popularity    runtime      score       vote 
##   1.86e+08   4.44e+07   3.62e+01   2.10e+01   8.60e-01   1.41e+03 
##     profit 
##   1.58e+08
##    revenue     budget popularity    runtime      score       vote 
##   3.47e+16   1.97e+15   1.31e+03   4.40e+02   7.39e-01   2.00e+06 
##     profit 
##   2.50e+16

The means, variance and sd between variables are quite high as most of them have different scales. We need to scale the data for some models like linear regression, PCR, KNN, etc.

Data visualization

Numnber of movies by company

Number of movie by season

Chapter 3. Revenue Prediction

Dependency

Numerical Variables

Genre

Test frequency distributions of revenue in different genres

Overall, there is an evidence that the frequency distributions of revenue in different genres are not the same. It seems that revenue is dependent on genres.

Company

Check freq disbutions of revenue in different companies

Overall, there is an evidence that the frequency distributions of revenue in different companies are not the same. It seems that revenue is dependent on companies.

Season

It seems that winter - fall are in the same group and spring - summer are in the same group.

Model

Training and testing sets

Linear model

Numerical Variables

Model Construction

Construct the model on training set (using all numberical variables)

## 
## Call:
## lm(formula = revenue ~ ., data = train1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.21e+08 -3.89e+07 -1.92e+06  2.46e+07  1.60e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 122161921    2168480   56.34  < 2e-16 ***
## budget       82502895    2822244   29.23  < 2e-16 ***
## popularity   14588304    2952684    4.94  8.4e-07 ***
## runtime      -1265467    2415512   -0.52     0.60    
## score          212212    2648862    0.08     0.94    
## vote         85807055    3723449   23.05  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01e+08 on 2170 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.711 
## F-statistic: 1.07e+03 on 5 and 2170 DF,  p-value: <2e-16
##     budget popularity    runtime      score       vote 
##       1.68       2.15       1.26       1.50       2.95

Prediction

Testing

##      mae     rmse 
## 6.25e+07 1.06e+08

Training

##      mae     rmse 
## 5.86e+07 1.01e+08

Feature selection

All three feature selection methods show that predictors (budget + popularity + vote) form the best model.

Best Model

Construct the model on train set

## 
## Call:
## lm(formula = revenue ~ budget + popularity + vote, data = train1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.21e+08 -3.85e+07 -2.19e+06  2.44e+07  1.60e+09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.22e+08   2.17e+06   56.36  < 2e-16 ***
## budget      8.23e+07   2.62e+06   31.45  < 2e-16 ***
## popularity  1.46e+07   2.95e+06    4.96  7.8e-07 ***
## vote        8.57e+07   3.47e+06   24.72  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01e+08 on 2172 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.711 
## F-statistic: 1.78e+03 on 3 and 2172 DF,  p-value: <2e-16
##     budget popularity       vote 
##       1.45       2.15       2.55

Prediction

##      mae     rmse 
## 6.25e+07 1.06e+08
##      mae     rmse 
## 5.86e+07 1.01e+08

No change when comparing to the model containing all numerical variables. We can remove unnecessary variables without reducing Adjusted R-squared or increasing RMSE. The best model is less complex and can avoid overfitting due to many predictors (high dimensionality).

Categorical and Numerical Variables

Model Construction

## 
## Call:
## lm(formula = revenue ~ ., data = train1_full)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.08e+08 -4.03e+07 -1.43e+06  2.90e+07  1.62e+09 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               108110505    6817790   15.86  < 2e-16 ***
## budget                     77348867    3033574   25.50  < 2e-16 ***
## popularity                 13617667    2932282    4.64  3.6e-06 ***
## runtime                     5900196    2594312    2.27  0.02305 *  
## score                      -1602884    2751829   -0.58  0.56030    
## vote                       87028064    3716646   23.42  < 2e-16 ***
## genresAdventure            14998048    8669568    1.73  0.08378 .  
## genresAnimation            87790816   13236699    6.63  4.2e-11 ***
## genresComedy               25268186    7163941    3.53  0.00043 ***
## genresCrime               -10102172   11467478   -0.88  0.37845    
## genresDocumentary          44711638   23188936    1.93  0.05397 .  
## genresDrama                 4923647    7294556    0.67  0.49976    
## genresFamily               82467003   20391297    4.04  5.4e-05 ***
## genresFantasy               2870822   13660650    0.21  0.83357    
## genresHistory              15689738   24998838    0.63  0.53032    
## genresHorror               17619142   10531825    1.67  0.09448 .  
## genresMusic                23079584   29354497    0.79  0.43182    
## genresMystery               6103862   24024211    0.25  0.79946    
## genresRomance              17057626   14926772    1.14  0.25327    
## genresScience Fiction     -15440346   15106799   -1.02  0.30686    
## genresThriller             -7679779   12337527   -0.62  0.53370    
## genresWar                 -56563625   33719837   -1.68  0.09360 .  
## genresWestern              -3456414   24929422   -0.14  0.88974    
## companyParamount Pictures  19120256    8303147    2.30  0.02139 *  
## companySony Pictures        4555821    7970621    0.57  0.56767    
## companyUniversal Pictures  16743343    7455117    2.25  0.02481 *  
## companyWalt Disney         16706288    6373478    2.62  0.00882 ** 
## companyWarner Bros          -535044    8559594   -0.06  0.95016    
## seasonSummer                -822742    6192423   -0.13  0.89431    
## seasonFall                 -9929435    6117507   -1.62  0.10471    
## seasonWinter               -5891796    6282550   -0.94  0.34845    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99300000 on 2145 degrees of freedom
## Multiple R-squared:  0.725,  Adjusted R-squared:  0.721 
## F-statistic:  188 on 30 and 2145 DF,  p-value: <2e-16
##                    budget                popularity 
##                      2.01                      2.20 
##                   runtime                     score 
##                      1.51                      1.68 
##                      vote           genresAdventure 
##                      3.04                      1.40 
##           genresAnimation              genresComedy 
##                      1.25                      1.79 
##               genresCrime         genresDocumentary 
##                      1.26                      1.08 
##               genresDrama              genresFamily 
##                      2.06                      1.08 
##             genresFantasy             genresHistory 
##                      1.14                      1.07 
##              genresHorror               genresMusic 
##                      1.32                      1.04 
##             genresMystery             genresRomance 
##                      1.04                      1.13 
##     genresScience Fiction            genresThriller 
##                      1.11                      1.18 
##                 genresWar             genresWestern 
##                      1.03                      1.06 
## companyParamount Pictures      companySony Pictures 
##                      1.09                      1.11 
## companyUniversal Pictures        companyWalt Disney 
##                      1.12                      1.18 
##        companyWarner Bros              seasonSummer 
##                      1.08                      1.61 
##                seasonFall              seasonWinter 
##                      1.65                      1.60

The p-values and t-values indicate that there is no significance among seasons. Season seems not to be a necessary predictor.

Feature Selection

When inlcuding the season, genre and company in the model, the best numerical predictors are still budget, popularity and vote. The effects of different seasons seem not to be significant. We will build the model with these 3 numerical predictors and 2 categorical variables: genre and company.

Best Model

## 
## Call:
## lm(formula = revenue ~ budget + vote + company + genres + popularity, 
##     data = train1_full)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.14e+08 -4.04e+07 -8.25e+05  2.96e+07  1.62e+09 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               103783104    5493099   18.89  < 2e-16 ***
## budget                     79307464    2810718   28.22  < 2e-16 ***
## vote                       87343750    3432074   25.45  < 2e-16 ***
## companyParamount Pictures  20187352    8296723    2.43  0.01505 *  
## companySony Pictures        4706890    7968099    0.59  0.55477    
## companyUniversal Pictures  17875529    7436574    2.40  0.01631 *  
## companyWalt Disney         16364447    6369330    2.57  0.01026 *  
## companyWarner Bros          -135185    8558273   -0.02  0.98740    
## genresAdventure            14924242    8645045    1.73  0.08443 .  
## genresAnimation            80203108   12811282    6.26  4.6e-10 ***
## genresComedy               24197175    7152449    3.38  0.00073 ***
## genresCrime                -8821228   11302914   -0.78  0.43522    
## genresDocumentary          40500854   22990580    1.76  0.07827 .  
## genresDrama                 6560315    6935298    0.95  0.34429    
## genresFamily               78112299   20296238    3.85  0.00012 ***
## genresFantasy               1516100   13618042    0.11  0.91136    
## genresHistory              22335236   24701813    0.90  0.36599    
## genresHorror               15897705   10491528    1.52  0.12985    
## genresMusic                22199706   29264598    0.76  0.44818    
## genresMystery               3544411   24017226    0.15  0.88269    
## genresRomance              16669177   14880624    1.12  0.26276    
## genresScience Fiction     -15795028   15101070   -1.05  0.29570    
## genresThriller             -7928898   12337416   -0.64  0.52051    
## genresWar                 -55041648   33646312   -1.64  0.10201    
## genresWestern                726641   24731085    0.03  0.97656    
## popularity                 13447493    2930278    4.59  4.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99400000 on 2150 degrees of freedom
## Multiple R-squared:  0.724,  Adjusted R-squared:  0.721 
## F-statistic:  225 on 25 and 2150 DF,  p-value: <2e-16
##                    budget                      vote 
##                      1.73                      2.59 
## companyParamount Pictures      companySony Pictures 
##                      1.09                      1.11 
## companyUniversal Pictures        companyWalt Disney 
##                      1.11                      1.17 
##        companyWarner Bros           genresAdventure 
##                      1.08                      1.39 
##           genresAnimation              genresComedy 
##                      1.17                      1.78 
##               genresCrime         genresDocumentary 
##                      1.22                      1.06 
##               genresDrama              genresFamily 
##                      1.86                      1.07 
##             genresFantasy             genresHistory 
##                      1.13                      1.04 
##              genresHorror               genresMusic 
##                      1.30                      1.03 
##             genresMystery             genresRomance 
##                      1.04                      1.12 
##     genresScience Fiction            genresThriller 
##                      1.11                      1.17 
##                 genresWar             genresWestern 
##                      1.03                      1.04 
##                popularity 
##                      2.19

The adj R-squared increases by 1.0% comparing to the the best model with numerical variables.

Prediction

Testing

##      mae     rmse 
## 6.19e+07 1.05e+08

Training

##      mae     rmse 
## 58333154 98791039

A slight improvement in this model. Adj R-squared increase by 1% and RMSE in both training set and testing set slightly decrease.

Model 1: budget + vote + popularity Model 2: budget + vote + popularity + company + genres

Model 1 has higher AIC than Model 2, which indicates that Model 2 is better for predicting the revenue. Model 1 has lower BIC than Model 2, which indicates that Model 1 is better as a true function to explain the revenue. (BIC prefers simple models)

Regression Tree

With Decision Tree we can address both numerical and categorical variables in the model. We perform the prediction of revenue (a continuous response) so we use regression tree.

Tree Construction

Model Construction

We can try two functions to build a regression tree model

## 
## Regression tree:
## tree(formula = revenue ~ ., data = train1_full)
## Variables actually used in tree construction:
## [1] "vote"   "budget" "genres"
## Number of terminal nodes:  10 
## Residual mean deviance:  9.85e+15 = 2.13e+19 / 2170 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -6.75e+08 -3.76e+07 -1.25e+07  0.00e+00  2.52e+07  1.24e+09

## Call:
## rpart(formula = revenue ~ ., data = train1_full, method = "anova")
##   n= 2176 
## 
##       CP nsplit rel error xerror   xstd
## 1 0.3861      0     1.000  1.001 0.1190
## 2 0.1250      1     0.614  0.709 0.0853
## 3 0.0704      2     0.489  0.535 0.0660
## 4 0.0528      3     0.418  0.483 0.0641
## 5 0.0265      4     0.366  0.455 0.0642
## 6 0.0256      5     0.339  0.452 0.0643
## 7 0.0139      6     0.313  0.412 0.0525
## 8 0.0107      7     0.300  0.383 0.0515
## 9 0.0100      8     0.289  0.385 0.0516
## 
## Variable importance
##       vote popularity     budget      score     genres     season 
##         42         22         20          7          6          1 
##    runtime    company 
##          1          1 
## 
## Node number 1: 2176 observations,    complexity param=0.386
##   mean=1.2e+08, MSE=3.53e+16 
##   left son=2 (1947 obs) right son=3 (229 obs)
##   Primary splits:
##       vote       < 0.949    to the left,  improve=0.3860, (0 missing)
##       budget     < 1.17     to the left,  improve=0.3840, (0 missing)
##       popularity < 1.27     to the left,  improve=0.3370, (0 missing)
##       genres     splits as  RRRLLLLRRLLLLLRLLL, improve=0.0995, (0 missing)
##       runtime    < 0.609    to the left,  improve=0.0587, (0 missing)
##   Surrogate splits:
##       popularity < 0.833    to the left,  agree=0.948, adj=0.502, (0 split)
##       budget     < 2.45     to the left,  agree=0.911, adj=0.153, (0 split)
##       score      < 1.79     to the left,  agree=0.902, adj=0.066, (0 split)
## 
## Node number 2: 1947 observations,    complexity param=0.0704
##   mean=8e+07, MSE=9.5e+15 
##   left son=4 (1481 obs) right son=5 (466 obs)
##   Primary splits:
##       vote       < -0.099   to the left,  improve=0.2930, (0 missing)
##       popularity < -0.00948 to the left,  improve=0.2540, (0 missing)
##       budget     < 0.716    to the left,  improve=0.2480, (0 missing)
##       company    splits as  LRRRRR, improve=0.0571, (0 missing)
##       genres     splits as  LRRLLLLRRLLLLLLLLL, improve=0.0563, (0 missing)
##   Surrogate splits:
##       popularity < 0.0996   to the left,  agree=0.922, adj=0.674, (0 split)
##       budget     < 1.09     to the left,  agree=0.791, adj=0.129, (0 split)
##       score      < 2.02     to the left,  agree=0.762, adj=0.004, (0 split)
##       runtime    < 4.4      to the left,  agree=0.761, adj=0.002, (0 split)
## 
## Node number 3: 229 observations,    complexity param=0.125
##   mean=4.61e+08, MSE=1.25e+17 
##   left son=6 (101 obs) right son=7 (128 obs)
##   Primary splits:
##       budget     < 0.739    to the left,  improve=0.335, (0 missing)
##       vote       < 2.5      to the left,  improve=0.269, (0 missing)
##       genres     splits as  RRRLL-LRRLL-LLRLLR, improve=0.210, (0 missing)
##       popularity < 1.8      to the left,  improve=0.176, (0 missing)
##       runtime    < 1.18     to the left,  improve=0.093, (0 missing)
##   Surrogate splits:
##       genres     splits as  RRRLL-LRRLL-LLRLLR, agree=0.790, adj=0.525, (0 split)
##       score      < 1.44     to the right, agree=0.716, adj=0.356, (0 split)
##       vote       < 1.97     to the left,  agree=0.633, adj=0.168, (0 split)
##       popularity < 1.15     to the left,  agree=0.607, adj=0.109, (0 split)
##       season     splits as  RRLR, agree=0.607, adj=0.109, (0 split)
## 
## Node number 4: 1481 observations,    complexity param=0.0139
##   mean=5.04e+07, MSE=3.52e+15 
##   left son=8 (772 obs) right son=9 (709 obs)
##   Primary splits:
##       vote       < -0.497   to the left,  improve=0.2040, (0 missing)
##       budget     < -0.32    to the left,  improve=0.1950, (0 missing)
##       popularity < -0.38    to the left,  improve=0.1820, (0 missing)
##       company    splits as  LRRRRR,       improve=0.0553, (0 missing)
##       runtime    < 0.18     to the left,  improve=0.0165, (0 missing)
##   Surrogate splits:
##       popularity < -0.389   to the left,  agree=0.916, adj=0.825, (0 split)
##       budget     < -0.32    to the left,  agree=0.630, adj=0.227, (0 split)
##       genres     splits as  RLRLRLLLRLRLRLRRRL, agree=0.568, adj=0.097, (0 split)
##       company    splits as  LRLRLL, agree=0.565, adj=0.092, (0 split)
##       score      < -0.423   to the left,  agree=0.554, adj=0.068, (0 split)
## 
## Node number 5: 466 observations,    complexity param=0.0256
##   mean=1.74e+08, MSE=1.69e+16 
##   left son=10 (322 obs) right son=11 (144 obs)
##   Primary splits:
##       budget  < 0.649    to the left,  improve=0.2500, (0 missing)
##       genres  splits as  LLRLL-LRLRLLLLLLLL, improve=0.1220, (0 missing)
##       company splits as  LRRRRR, improve=0.0808, (0 missing)
##       score   < 0.973    to the right, improve=0.0708, (0 missing)
##       vote    < 0.432    to the left,  improve=0.0478, (0 missing)
##   Surrogate splits:
##       genres     splits as  LRRLL-LRLRLLLLLLLL, agree=0.755, adj=0.208, (0 split)
##       score      < -0.423   to the right, agree=0.710, adj=0.063, (0 split)
##       popularity < 1.54     to the left,  agree=0.695, adj=0.014, (0 split)
##       vote       < 0.909    to the left,  agree=0.695, adj=0.014, (0 split)
## 
## Node number 6: 101 observations,    complexity param=0.0107
##   mean=2.3e+08, MSE=3.14e+16 
##   left son=12 (85 obs) right son=13 (16 obs)
##   Primary splits:
##       genres     splits as  LRRLL-L-LLL-RLLLL-, improve=0.2590, (0 missing)
##       vote       < 2.55     to the left,  improve=0.2120, (0 missing)
##       budget     < -0.192   to the left,  improve=0.1350, (0 missing)
##       popularity < 1.81     to the left,  improve=0.0997, (0 missing)
##       season     splits as  RRLR, improve=0.0770, (0 missing)
##   Surrogate splits:
##       runtime < -0.941   to the right, agree=0.851, adj=0.063, (0 split)
##       score   < -0.365   to the right, agree=0.851, adj=0.063, (0 split)
## 
## Node number 7: 128 observations,    complexity param=0.0528
##   mean=6.43e+08, MSE=1.24e+17 
##   left son=14 (71 obs) right son=15 (57 obs)
##   Primary splits:
##       vote       < 2.44     to the left,  improve=0.2550, (0 missing)
##       budget     < 3.78     to the left,  improve=0.2390, (0 missing)
##       popularity < 3.14     to the left,  improve=0.2330, (0 missing)
##       runtime    < 1.13     to the left,  improve=0.0954, (0 missing)
##       score      < -0.772   to the left,  improve=0.0715, (0 missing)
##   Surrogate splits:
##       score      < 0.624    to the left,  agree=0.719, adj=0.368, (0 split)
##       popularity < 1.67     to the left,  agree=0.711, adj=0.351, (0 split)
##       runtime    < 1.13     to the left,  agree=0.656, adj=0.228, (0 split)
##       budget     < 2.57     to the left,  agree=0.641, adj=0.193, (0 split)
##       company    splits as  LLLLRR,       agree=0.633, adj=0.175, (0 split)
## 
## Node number 8: 772 observations
##   mean=2.47e+07, MSE=7.93e+14 
## 
## Node number 9: 709 observations
##   mean=7.84e+07, MSE=4.98e+15 
## 
## Node number 10: 322 observations
##   mean=1.3e+08, MSE=9.82e+15 
## 
## Node number 11: 144 observations
##   mean=2.71e+08, MSE=1.91e+16 
## 
## Node number 12: 85 observations
##   mean=1.91e+08, MSE=1.99e+16 
## 
## Node number 13: 16 observations
##   mean=4.38e+08, MSE=4.15e+16 
## 
## Node number 14: 71 observations
##   mean=4.83e+08, MSE=4.95e+16 
## 
## Node number 15: 57 observations,    complexity param=0.0265
##   mean=8.41e+08, MSE=1.46e+17 
##   left son=30 (47 obs) right son=31 (10 obs)
##   Primary splits:
##       budget     < 3.98     to the left,  improve=0.2450, (0 missing)
##       popularity < 2.88     to the left,  improve=0.2080, (0 missing)
##       vote       < 4.63     to the left,  improve=0.1260, (0 missing)
##       score      < 1.32     to the right, improve=0.0656, (0 missing)
##       genres     splits as  RRR-L-LRL-----R--L, improve=0.0449, (0 missing)
##   Surrogate splits:
##       vote < 7.31     to the left,  agree=0.842, adj=0.1, (0 split)
## 
## Node number 30: 47 observations
##   mean=7.54e+08, MSE=6.86e+16 
## 
## Node number 31: 10 observations
##   mean=1.25e+09, MSE=3.06e+17
## 
## Regression tree:
## rpart(formula = revenue ~ ., data = train1_full, method = "anova")
## 
## Variables actually used in tree construction:
## [1] budget genres vote  
## 
## Root node error: 8e+19/2176 = 4e+16
## 
## n= 2176 
## 
##     CP nsplit rel error xerror xstd
## 1 0.39      0       1.0    1.0 0.12
## 2 0.13      1       0.6    0.7 0.09
## 3 0.07      2       0.5    0.5 0.07
## 4 0.05      3       0.4    0.5 0.06
## 5 0.03      4       0.4    0.5 0.06
## 6 0.03      5       0.3    0.5 0.06
## 7 0.01      6       0.3    0.4 0.05
## 8 0.01      7       0.3    0.4 0.05
## 9 0.01      8       0.3    0.4 0.05

2 methods give the same tree. rpart() gives access to nicer plots.

Prediction

Testing

##      mae     rmse 
## 6.54e+07 1.13e+08

Training

##      mae     rmse 
## 5.93e+07 1.01e+08

The results seem to be worse than linear models.

Pruned Tree

Pruning

We can see the error for each Cp.

## 
## Regression tree:
## rpart(formula = revenue ~ ., data = train1_full, method = "anova")
## 
## Variables actually used in tree construction:
## [1] budget genres vote  
## 
## Root node error: 8e+19/2176 = 4e+16
## 
## n= 2176 
## 
##     CP nsplit rel error xerror xstd
## 1 0.39      0       1.0    1.0 0.12
## 2 0.13      1       0.6    0.7 0.09
## 3 0.07      2       0.5    0.5 0.07
## 4 0.05      3       0.4    0.5 0.06
## 5 0.03      4       0.4    0.5 0.06
## 6 0.03      5       0.3    0.5 0.06
## 7 0.01      6       0.3    0.4 0.05
## 8 0.01      7       0.3    0.4 0.05
## 9 0.01      8       0.3    0.4 0.05

##     1     2     3     4     5     6     7     8     9 
## 1.001 0.709 0.535 0.483 0.455 0.452 0.412 0.383 0.385

The error is lowest at CP = 8.

Prediction

Testing

##      mae     rmse 
## 6.59e+07 1.15e+08

Training

##      mae     rmse 
## 6.04e+07 1.03e+08

There are not significant changes in the results.

Random Forest

Random Forest Model

The data contains many variables. Furthermore, we are predicting a testing data using the model constructed on training set. Therefore, Random Forest (RF) would be better than Decision Tree (which prefers fewer variables and predicts within the sample/training data).

Model Construction

## 
## Call:
##  randomForest(formula = revenue ~ ., data = train1_full, ntree = 350) 
##                Type of random forest: regression
##                      Number of trees: 350
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 9.46e+15
##                     % Var explained: 73.2
##            IncNodePurity
## budget          1.90e+19
## popularity      1.43e+19
## runtime         4.15e+18
## score           3.39e+18
## vote            2.33e+19
## genres          5.58e+18
## company         2.60e+18
## season          1.49e+18

Budget + popularity + score have highest importances. In our best linear model, the selected numerical variables are also the same as Random Forest. Two models seem to have an agreement.

There is a difference when accommodating categorical variables. Linear Model does not select runtime but selects company and genres to be added, while in Random Forest model runtime is more importanct than company. In this case, both models seem to agree on the genres variable only.

MSE by the number of trees

When the number of trees increase, the mean squared error MSE decease. After a number of trees (around 100 trees in our case), the MSE does not have any significant change.

Testing

##      mae     rmse 
## 55398015 98520833

Training

##      mae     rmse 
## 52851250 97252127

Random Forest has lower RMSEs and MAEs than Linear Model.

The pseudo R-squared of RF is slightly higher than Adj R-squared of Linear Model.

Hyperparameter Tuning

Let’s look if we can make our RF model better by using tuning method.

## mtry = 2  OOB error = 9.46e+15 
## Searching left ...
## mtry = 1     OOB error = 1.06e+16 
## -0.118 0.05 
## Searching right ...
## mtry = 4     OOB error = 9.33e+15 
## 0.014 0.05

mtry giving lowest OOB Error is 4. Now, let’s build a RF model with mtry = 4.

Tuned RF model

Model Construction

## 
## Call:
##  randomForest(formula = revenue ~ ., data = train1_full, mtry = 4,      ntree = 350) 
##                Type of random forest: regression
##                      Number of trees: 350
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 9.17e+15
##                     % Var explained: 74

An improvement in pseudo R-squared. %Var explained increases by 1%.

Prediction

Testing

##     mae    rmse 
## 5.5e+07 9.7e+07

Training

##     mae    rmse 
## 5.2e+07 9.6e+07

We can see the decreases in MAE and RMSE after tuning the forest.

Summary

Model Linear Model (budget + vote + popularity + company + genres) Regression Tree Random Forest
R-squared(adjusted/ pseudo) 0.721 0.711 0.741
MAE - train 5.83310^{7} 6.04410^{7} 5.2e+07
MAE - test 6.18810^{7} 6.59510^{7} 5.5e+07
RMSE - train 9.87910^{7} 1.02910^{8} 9.6e+07
RMSE - test 1.04510^{8} 1.1510^{8} 9.7e+07

Chapter 4. Principal Component Analysis

We want to examine if Principal Component Analysis method is good at dimensional reduction. In our data, we have 5 continuous variables to predict the revenue. We will perform principal component regression directly and see how many components are sufficient for the model.

In this part, we will also clarify the reason to scale our data in previous chapter by comparing two PCR models of different versions: centered data and non-centered data.

Variance

We will also include revenue when checking variance.

Non-centered data

## Importance of components:
##                             PC1      PC2 PC3  PC4  PC5   PC6
## Standard deviation     1.89e+08 3.10e+07 926 23.9 20.1 0.699
## Proportion of Variance 9.74e-01 2.62e-02   0  0.0  0.0 0.000
## Cumulative Proportion  9.74e-01 1.00e+00   1  1.0  1.0 1.000

Centered data

## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6
## Standard deviation     1.768 1.108 0.894 0.6466 0.5045 0.4186
## Proportion of Variance 0.521 0.204 0.133 0.0697 0.0424 0.0292
## Cumulative Proportion  0.521 0.725 0.859 0.9284 0.9708 1.0000

We see a significant differences in the variances and standard deviations of different components. Actually, the revenue and our predictors except budget are measured by different scales. And, the revenue and budget are measured in millions which are significantly bigger than the scales of vote, score, popularity …

Therefore, we recommend to scale the data before constructing the model. We will see the differences between non-centered and centered data in the following models.

PVE

In non-centered version, 1 component explains almost 100% of the variance. In centered version, 1 component only explains about 50% of the variance. To reach 80%, we need 3 components. Another evidence to indicate that we should perform scaling since the the revenue and budget seem to overwhelm the components in non-centered version, causing the PC1 to capture nearly the entire variance.

Model with PCA

PCR Model

Validation

Non-centered version

Centered version

Both versions agree that 2 components give the optimized MSEP and R2.

We can see the coefficients of different components.

We can see that after PC2, the change of coefficients is not significant comparing to the difference between the coeffcients of PC1 and PC2. Hence, we can use PC1 and PC2 to optimize the model since two components are sufficient to capture the variance.

Summary

We can make a comparision between non-centered and centered version.

Non-centered version

## Data:    X dimension: 2176 5 
##  Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)    1 comps    2 comps    3 comps    4 comps    5 comps
## CV        1.88e+08  122426183  109504309  107691445  103280300  104020634
## adjCV     1.88e+08  122348421  109403985  107658736  103203256  103846041
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          49.00    71.71    87.31    95.63   100.00
## revenue    57.87    66.61    67.94    70.47    71.13
## Data:    X dimension: 2176 5 
##  Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)    1 comps    2 comps    3 comps    4 comps    5 comps
## CV        1.88e+08  122115877  106538623  105948677  103446210  102956254
## adjCV     1.88e+08  122008326  106436756  105914036  103383690  102847338
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          48.50    71.69    87.41    95.61   100.00
## revenue    58.21    68.09    68.67    70.49    71.13

Scaled data have better variance explanation for revenue than non-scaled data.

Prediction

Let’s try the pcr model on testing data. We will use the centered version.

There is a significant increase in the variance from PC1 to PC2, after that the change of variance is not too drastic. We can say that 2 principal components can capture the majority of variance in the testing data. Our PCR model seems to perform properly in the testing data.

Linear Model

We can use principal components as the predictors for a linear model. We will use two components to build the models with two versions: centered and non-centered.

Non-centered version

## 
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr.nc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.20e+09 -4.08e+07 -5.95e+06  2.28e+07  1.76e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 121493614    2330120    52.1   <2e-16 ***
## PC1         -89813816    1463518   -61.4   <2e-16 ***
## PC2          51258272    2149532    23.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09e+08 on 2173 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.666 
## F-statistic: 2.17e+03 on 2 and 2173 DF,  p-value: <2e-16
## PC1 PC2 
##   1   1

Centered version

## 
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.11e+09 -4.03e+07 -4.95e+06  2.42e+07  1.73e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120012282    2277573    52.7   <2e-16 ***
## PC1         -92108152    1462914   -63.0   <2e-16 ***
## PC2          54901946    2115458    25.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.06e+08 on 2173 degrees of freedom
## Multiple R-squared:  0.681,  Adjusted R-squared:  0.681 
## F-statistic: 2.32e+03 on 2 and 2173 DF,  p-value: <2e-16
## PC1 PC2 
##   1   1

All vifs are 1, nice feature from PCA method.

The scaled version gives better results than non-scaled version.

We can also use AIC and BIC for validation between non-scaled and scaled versions.

## [1] 86611
## [1] 86633
## [1] 86710
## [1] 86732

Both AIC and BIC agree that the scaled version is better.

Conclusion

Comparing to the linear model (of numerical variables) in previous chapter, the Adjusted R-squared slightly decreases. The value drops from 71.1% to 68.1%. It is in our expectation since the goal of PCA is to reduce dimensionality. Instead of using 5 variables (budget + popularity + vote + score + runtime + ), we need only 2 variables (PC1 and PC2). We can speed up the computational process while not significantly hurting the performance of the model

Chapter 5. Profit Prediction

In this chapter, we use different kinds of model to predict if a movie earns profit or not (box office revenue > budget -> earn profit). In our data, we use column profitable to record whether a movie earns profit or not; a movie with profit is labeled as 1, otherwise it is labeled as 0.

Prior Chi-squared test

Genres

Company

Season

  • Low p-values, there is evidence that profitable is dependent on season, company and genres.

Logitic Regression

In this part we will construct the logit model on the whole dataset.

Budget and revenue are enough to decide the profitable as the profit is calculated as revenue subtracted by budget. Our pre-test with “bestglm” also shows the same result.

##   revenue budget popularity runtime score  vote genres company season
## 1    TRUE   TRUE      FALSE   FALSE FALSE FALSE  FALSE   FALSE  FALSE
## 2    TRUE   TRUE      FALSE   FALSE FALSE FALSE  FALSE   FALSE   TRUE
## 3    TRUE   TRUE      FALSE    TRUE FALSE FALSE  FALSE   FALSE  FALSE
## 4    TRUE   TRUE       TRUE   FALSE FALSE FALSE  FALSE   FALSE   TRUE
## 5    TRUE   TRUE      FALSE   FALSE FALSE  TRUE  FALSE   FALSE   TRUE
##   Criterion
## 1      17.4
## 2      17.6
## 3      18.1
## 4      18.3
## 5      18.5

However, since the relationship between (revenue + budget) and profitable is too direct, we should not use them together.

In reality, we prefer budget rather than revenue to predict profit. A film manager would want to have a prediction of the profit of a movie before its main released date. The information he/she have are the budget, runtime, genres, production company, popularity, vote and score (vote and score can be obtained by a preview screening of a movie, popularity can be generated after advertisement, trailers and some leaks from a movie). Revenue should play a role as the response in the model rather than a predictor.

Model Construction

Let’s try the model with budget and other predictors

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train3)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.683   0.000   0.297   0.735   1.743  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.05e+00   5.35e-01   -3.84  0.00012 ***
## budget                    -1.75e-08   2.51e-09   -6.99  2.8e-12 ***
## popularity                 1.61e-02   1.26e-02    1.28  0.20159    
## runtime                    1.66e-03   3.28e-03    0.50  0.61421    
## score                      2.27e-01   8.46e-02    2.68  0.00739 ** 
## vote                       2.67e-03   4.49e-04    5.95  2.7e-09 ***
## genresAdventure            1.73e-01   2.51e-01    0.69  0.49015    
## genresAnimation            2.70e-01   3.97e-01    0.68  0.49646    
## genresComedy               4.31e-01   1.91e-01    2.26  0.02360 *  
## genresCrime                7.35e-02   2.96e-01    0.25  0.80411    
## genresDocumentary          4.66e-01   5.25e-01    0.89  0.37479    
## genresDrama                7.25e-02   1.91e-01    0.38  0.70424    
## genresFamily               2.31e-01   5.54e-01    0.42  0.67656    
## genresFantasy              2.28e-01   4.32e-01    0.53  0.59765    
## genresHistory              6.55e-01   7.13e-01    0.92  0.35873    
## genresHorror               7.92e-01   3.16e-01    2.51  0.01212 *  
## genresMusic                2.74e-01   7.09e-01    0.39  0.69916    
## genresMystery             -4.23e-01   6.33e-01   -0.67  0.50393    
## genresRomance              8.50e-01   4.26e-01    2.00  0.04590 *  
## genresScience Fiction      2.64e-01   5.05e-01    0.52  0.60148    
## genresThriller            -1.50e-01   3.29e-01   -0.46  0.64724    
## genresWar                 -1.36e+00   8.29e-01   -1.64  0.10019    
## genresWestern              2.27e+00   1.08e+00    2.10  0.03582 *  
## companyParamount Pictures  9.81e-01   2.43e-01    4.03  5.5e-05 ***
## companySony Pictures       6.20e-01   2.22e-01    2.79  0.00524 ** 
## companyUniversal Pictures  8.52e-01   2.34e-01    3.64  0.00027 ***
## companyWalt Disney         8.60e-01   1.84e-01    4.67  3.0e-06 ***
## companyWarner Bros         7.69e-01   2.45e-01    3.13  0.00172 ** 
## seasonSummer               3.90e-01   1.74e-01    2.24  0.02505 *  
## seasonFall                -4.65e-02   1.62e-01   -0.29  0.77361    
## seasonWinter               7.04e-02   1.69e-01    0.42  0.67734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2409.2  on 2175  degrees of freedom
## Residual deviance: 1805.0  on 2145  degrees of freedom
## AIC: 1867
## 
## Number of Fisher Scoring iterations: 8

We can test the effects of different genres/companies/seasons on the prediction to see whether they are significant.

  • Genres
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 23.8, df = 17, P(> X2) = 0.12

It seems that different genres do not have significant effects on the response in our logit model.

  • Company
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 46.2, df = 5, P(> X2) = 8.4e-09

The effects of different companies are significant.

  • Season
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 8.0, df = 3, P(> X2) = 0.045

The effects of different seasons are significant, but not as clear as the effects of different companies.

Feature Selection

##   budget popularity runtime score vote genres company season Criterion
## 1   TRUE      FALSE   FALSE  TRUE TRUE  FALSE    TRUE   TRUE      1855
## 2   TRUE       TRUE   FALSE  TRUE TRUE  FALSE    TRUE   TRUE      1856
## 3   TRUE      FALSE    TRUE  TRUE TRUE  FALSE    TRUE   TRUE      1857
## 4   TRUE       TRUE    TRUE  TRUE TRUE  FALSE    TRUE   TRUE      1858
## 5   TRUE      FALSE   FALSE  TRUE TRUE  FALSE    TRUE  FALSE      1858
  • Best model : budget + score + vote + company + season

Best Logit Model

Build the best model

## 
## Call:
## glm(formula = y ~ budget + score + vote + company + season, family = "binomial", 
##     data = train3)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.733   0.000   0.306   0.763   1.770  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.44e+00   4.57e-01   -3.15  0.00163 ** 
## budget                    -1.83e-08   2.30e-09   -7.95  1.9e-15 ***
## score                      2.10e-01   7.18e-02    2.93  0.00340 ** 
## vote                       3.18e-03   2.37e-04   13.43  < 2e-16 ***
## companyParamount Pictures  9.64e-01   2.40e-01    4.02  5.7e-05 ***
## companySony Pictures       5.48e-01   2.19e-01    2.50  0.01237 *  
## companyUniversal Pictures  8.42e-01   2.30e-01    3.67  0.00024 ***
## companyWalt Disney         8.75e-01   1.80e-01    4.86  1.2e-06 ***
## companyWarner Bros         7.88e-01   2.41e-01    3.27  0.00109 ** 
## seasonSummer               3.84e-01   1.72e-01    2.24  0.02513 *  
## seasonFall                -6.93e-02   1.59e-01   -0.43  0.66375    
## seasonWinter               5.37e-02   1.66e-01    0.32  0.74708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2409.2  on 2175  degrees of freedom
## Residual deviance: 1832.9  on 2164  degrees of freedom
## AIC: 1857
## 
## Number of Fisher Scoring iterations: 7

Model Evaluation

We can validate the model on the testing set with following methods:

Hosmer and Lemeshow test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  test3$y, prof_glm_pred
## X-squared = 1049, df = 8, p-value <2e-16

Low p-value. Both models seem to be a good fit.

ROC curve and AUC

## Area under the curve: 0.849

The area under the curve is more than 0.80. This test also agrees with the Hosmer and Lemeshow test.

McFadden

##       llh   llhNull        G2  McFadden      r2ML      r2CU 
##  -916.473 -1204.608   576.271     0.239     0.233     0.348

23.9% the variance in y is explained by the predictors in our model. Not so bad but not so good.

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 248 416
##          1  12 373
##                                         
##                Accuracy : 0.592         
##                  95% CI : (0.562, 0.622)
##     No Information Rate : 0.752         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.28          
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.954         
##             Specificity : 0.473         
##          Pos Pred Value : 0.373         
##          Neg Pred Value : 0.969         
##              Prevalence : 0.248         
##          Detection Rate : 0.236         
##    Detection Prevalence : 0.633         
##       Balanced Accuracy : 0.713         
##                                         
##        'Positive' Class : 0             
## 
##   threshold accuracy
## 1       0.5     81.9
## 2       0.6     80.7
## 3       0.7     77.4
## 4       0.8     69.0
## 5       0.9     59.2

Classification Tree

Model Construction

We can visualize our tree

Pruned Tree model

We can optimize our tree by pruning (in complex model, pruning helps to reduce overfitting).

## 
## Classification tree:
## rpart(formula = y ~ ., data = train3, method = "class")
## 
## Variables actually used in tree construction:
## [1] budget  company genres  vote   
## 
## Root node error: 527/2176 = 0.2
## 
## n= 2176 
## 
##     CP nsplit rel error xerror xstd
## 1 0.08      0       1.0    1.0 0.04
## 2 0.03      2       0.8    0.9 0.04
## 3 0.02      4       0.8    0.9 0.04
## 4 0.01      7       0.7    0.8 0.04
## 5 0.01      8       0.7    0.8 0.04

Relative error is lowest at CP = 4, numbers of split = 7.

Let’s see our tree after pruning.

The last split was pruned.

Model Evaluation

Accuraccy

##       predicted
## actual   0   1
##      0  88 172
##      1  35 754
## [1] 0.803

ROC curve and AUC

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  88  35
##          1 172 754
##                                         
##                Accuracy : 0.803         
##                  95% CI : (0.777, 0.826)
##     No Information Rate : 0.752         
##     P-Value [Acc > NIR] : 6.09e-05      
##                                         
##                   Kappa : 0.357         
##                                         
##  Mcnemar's Test P-Value : < 2e-16       
##                                         
##             Sensitivity : 0.3385        
##             Specificity : 0.9556        
##          Pos Pred Value : 0.7154        
##          Neg Pred Value : 0.8143        
##              Prevalence : 0.2479        
##          Detection Rate : 0.0839        
##    Detection Prevalence : 0.1173        
##       Balanced Accuracy : 0.6471        
##                                         
##        'Positive' Class : 0             
## 
## Area under the curve: 0.764

The area under the curve is 0.764 (smaller than 0.8). Classification tree seems not be as good as Logistic Regression in this case.

Chapter 6. Season, Company, Genres Prediction

We will use KNN model in this part

Season

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##              | season_knn 
## test2$season |    Spring |    Summer |      Fall |    Winter | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       Spring |        40 |        63 |        64 |        47 |       214 | 
##              |     0.187 |     0.294 |     0.299 |     0.220 |     0.204 | 
##              |     0.189 |     0.243 |     0.190 |     0.195 |           | 
##              |     0.038 |     0.060 |     0.061 |     0.045 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       Summer |        75 |        84 |        63 |        59 |       281 | 
##              |     0.267 |     0.299 |     0.224 |     0.210 |     0.268 | 
##              |     0.354 |     0.324 |     0.187 |     0.245 |           | 
##              |     0.071 |     0.080 |     0.060 |     0.056 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##         Fall |        49 |        72 |       125 |        81 |       327 | 
##              |     0.150 |     0.220 |     0.382 |     0.248 |     0.312 | 
##              |     0.231 |     0.278 |     0.371 |     0.336 |           | 
##              |     0.047 |     0.069 |     0.119 |     0.077 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       Winter |        48 |        40 |        85 |        54 |       227 | 
##              |     0.211 |     0.176 |     0.374 |     0.238 |     0.216 | 
##              |     0.226 |     0.154 |     0.252 |     0.224 |           | 
##              |     0.046 |     0.038 |     0.081 |     0.051 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       212 |       259 |       337 |       241 |      1049 | 
##              |     0.202 |     0.247 |     0.321 |     0.230 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## 
## 

k= 35 gives the best accuracy

Genres

There are many genres so we won’t show the cross table

k = 27 gives the best accuracy

Company

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##                    | company_knn 
##      test2$company |             Others | Paramount Pictures |      Sony Pictures | Universal Pictures |        Walt Disney |        Warner Bros |          Row Total | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##             Others |                447 |                 10 |                 10 |                 17 |                 48 |                  2 |                534 | 
##                    |              0.837 |              0.019 |              0.019 |              0.032 |              0.090 |              0.004 |              0.509 | 
##                    |              0.548 |              0.370 |              0.312 |              0.395 |              0.381 |              0.400 |                    | 
##                    |              0.426 |              0.010 |              0.010 |              0.016 |              0.046 |              0.002 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Paramount Pictures |                 67 |                  2 |                  1 |                  2 |                 14 |                  0 |                 86 | 
##                    |              0.779 |              0.023 |              0.012 |              0.023 |              0.163 |              0.000 |              0.082 | 
##                    |              0.082 |              0.074 |              0.031 |              0.047 |              0.111 |              0.000 |                    | 
##                    |              0.064 |              0.002 |              0.001 |              0.002 |              0.013 |              0.000 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##      Sony Pictures |                 62 |                  5 |                  3 |                  9 |                  9 |                  0 |                 88 | 
##                    |              0.705 |              0.057 |              0.034 |              0.102 |              0.102 |              0.000 |              0.084 | 
##                    |              0.076 |              0.185 |              0.094 |              0.209 |              0.071 |              0.000 |                    | 
##                    |              0.059 |              0.005 |              0.003 |              0.009 |              0.009 |              0.000 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Universal Pictures |                 81 |                  3 |                  9 |                  7 |                 17 |                  0 |                117 | 
##                    |              0.692 |              0.026 |              0.077 |              0.060 |              0.145 |              0.000 |              0.112 | 
##                    |              0.099 |              0.111 |              0.281 |              0.163 |              0.135 |              0.000 |                    | 
##                    |              0.077 |              0.003 |              0.009 |              0.007 |              0.016 |              0.000 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##        Walt Disney |                109 |                  7 |                  7 |                  7 |                 27 |                  2 |                159 | 
##                    |              0.686 |              0.044 |              0.044 |              0.044 |              0.170 |              0.013 |              0.152 | 
##                    |              0.134 |              0.259 |              0.219 |              0.163 |              0.214 |              0.400 |                    | 
##                    |              0.104 |              0.007 |              0.007 |              0.007 |              0.026 |              0.002 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##        Warner Bros |                 50 |                  0 |                  2 |                  1 |                 11 |                  1 |                 65 | 
##                    |              0.769 |              0.000 |              0.031 |              0.015 |              0.169 |              0.015 |              0.062 | 
##                    |              0.061 |              0.000 |              0.062 |              0.023 |              0.087 |              0.200 |                    | 
##                    |              0.048 |              0.000 |              0.002 |              0.001 |              0.010 |              0.001 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##       Column Total |                816 |                 27 |                 32 |                 43 |                126 |                  5 |               1049 | 
##                    |              0.778 |              0.026 |              0.031 |              0.041 |              0.120 |              0.005 |                    | 
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## 
## 

k = 23 gives the best accuracy

Chapter 7. Revenue by Season

We use time series model

Time Series

Visualization

Trend and Seasonality

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.55e+09 -6.35e+08  2.92e+08  0.00e+00  9.27e+08  9.67e+08
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -1.05e+09 -4.67e+08 -8.91e+07 -3.50e+06  4.29e+08  1.99e+09         4

HoltWinters Method

Forecasting

Prediction on the future

Model Evaluation

We predict the model on the testing data

Better visualization with highcharter

ARIMA Model

Forecasting

Model Evaluation

We can also plot nicer with highcharter